Start

  • The numeric files has few additional features, but there is no description of the features so I will just go ahead with the file with both numeric and categorical features

EDA

Note: I have only done the EDA to answer the asked questions. I have not done any EDA for the purpose of feature engineering or feature selection.

Missingness

##          checking_account_status               duration_in_months 
##                                0                                0 
##                   credit_history                          purpose 
##                                0                                0 
##                    credit_amount           savings_account_status 
##                                0                                0 
##         present_employment_since installment_as_percent_of_income 
##                                0                                0 
##                 marital_sex_type            role_in_other_credits 
##                                0                                0 
##           present_resident_since                      assset_type 
##                                0                                0 
##                              age          other_installment_plans 
##                                0                                0 
##                     housing_type           count_existing_credits 
##                                0                                0 
##                  employment_type                 count_dependents 
##                                0                                0 
##                    has_telephone                is_foreign_worker 
##                                0                                0 
##                 is_credit_worthy 
##                                0

So, no missing data. Yayyy!

Credit Worthiness

Before going into exploring relationship of predictors with the target, let’s first clearly define the target

Credit worthiness for a group of observations can be measured by Good/Total proportion. Higher the proportion, higher the credit worthiness

Credit History

Question: Would a person with critical credit history, be more credit worthy?

Again, let’s first define what critical means. In the absence of any concrete definition, I will assume ‘critical’ roughly means more existing credits i.e. it increase from A30 to A35

Critical has positive association with credit worthiness

Age

Q. Are young people more creditworthy?

The distributions are quite overlapping. But there are more young in “Bad” compared to “Good”, and that is also visible in the difference in means. > So, young people seem slightly less credit worthy.

But let’s break the age into groups to see finer details

“Bad” is quite low for the (34, 39] age group

Credit Accounts

Q. Would a person with more credit accounts, be more credit worthy?

I am assuming more credit accounts is same as “Number of existing credits at this bank” i.e. ‘count_existing_credits’

Data is too unreliable to say anything on the relationship between no. of credit accounts and credit worthiness

Feature Engineering & Selection

As mentioned earlier I didn’t do any EDA from featre engineering perspective. So, there is no feature engineering.

For feature selection I have used Boruta, which I have found to be the best feature selection technique almost always. Below is how the Boruta plot looks like:

Selected features are:

##  [1] "checking_account_status"          "duration_in_months"              
##  [3] "credit_history"                   "purpose"                         
##  [5] "credit_amount"                    "savings_account_status"          
##  [7] "present_employment_since"         "installment_as_percent_of_income"
##  [9] "role_in_other_credits"            "assset_type"                     
## [11] "age"                              "other_installment_plans"         
## [13] "housing_type"                     "employment_type"                 
## [15] "is_credit_worthy"

Modeling

Strategy

It is worse to class a customer as ‘Good’ when they are ‘Bad’, than it is to class a customer as bad when they are good.

Let ‘Good’ be the positive class, and ‘Bad’ be the negative class. So the above statement will translate to:

False Positives (FPs) are more expensive than False Negatives (FNs)

Such cases fall under **Cost Sensitive Learning" strategy, and followong sub-strategies can be followed decided under it:

Strategy Options

  • Modeling Strategies for cost sensitive learning
    • Change cost function
      • Change the function itself
        • the main function
        • penalty component
      • Change function parameters
        • oversample positive class
          • synthetic sample generation (like SMOTE)
          • give more weight
        • undersample sample negative class
          • give less weight
    • Optimize thresholds that are used for converting output probabilities into class labels - valid only for models which output probabilities
    • Ensembling
  • Evaluation Strategies for Cost sensitive classification
    • Favour Precision over Accuracy or Recall
    • Give weights to different buckets in confusion matrix, and use that to construct a custom evaluation metric

Options that I will explore

Models

I will try the following three models: - Logistic Regression - Boosted Trees: GBM - Random Forest

Modeling Strategy

  • Will optimize thresholds for all the models
  • give more weight to positive class, I will tune the weighing parameter: will do this only for GBM, just to showcase

Evaluation Strategy

I will go with a Custom evaluation metric:

I have assigned follwing weights to different buckets of the confusion matrix to penalize each bucket differently

##           Reference
## Prediction Good Bad
##       Good -0.4   1
##       Bad   0.2   0

There is no particular reason for these values, just their relative differences are important because they penalize FPs more than FNs. PLus, I am rewarding TPs (True Positives)

Now, the custom metric is just the normalized sum-product of these weights and the confusion matrix of the model. Let’s call it “credit_cost”.

Splitting

I have 80:20 splitting. For validation, I will be using cross-validation wherever required.

Baseline

I am taking baseline as predicting everybody as "Good’

Train credit_cost

## Baseline Train Cost: 0.0206982543640898
## Baseline Train Precision: 0.699501246882793

Test credit_cost

## Baseline Test Cost: 0.0171717171717172
## Baseline Test Precision: 0.702020202020202

Logistic Regression

Train Results:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Bad
##       Good  518 116
##       Bad    43 125
##                                          
##                Accuracy : 0.802          
##                  95% CI : (0.772, 0.829) 
##     No Information Rate : 0.7            
##     P-Value [Acc > NIR] : 0.0000000000343
##                                          
##                   Kappa : 0.484          
##                                          
##  Mcnemar's Test P-Value : 0.0000000112995
##                                          
##             Sensitivity : 0.923          
##             Specificity : 0.519          
##          Pos Pred Value : 0.817          
##          Neg Pred Value : 0.744          
##              Prevalence : 0.700          
##          Detection Rate : 0.646          
##    Detection Prevalence : 0.791          
##       Balanced Accuracy : 0.721          
##                                          
##        'Positive' Class : Good           
## 

Boosted Trees - GBM

Train Results:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Bad
##       Good  543  16
##       Bad    18 225
##                                              
##                Accuracy : 0.958              
##                  95% CI : (0.941, 0.97)      
##     No Information Rate : 0.7                
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.899              
##                                              
##  Mcnemar's Test P-Value : 0.864              
##                                              
##             Sensitivity : 0.968              
##             Specificity : 0.934              
##          Pos Pred Value : 0.971              
##          Neg Pred Value : 0.926              
##              Prevalence : 0.700              
##          Detection Rate : 0.677              
##    Detection Prevalence : 0.697              
##       Balanced Accuracy : 0.951              
##                                              
##        'Positive' Class : Good               
## 

Random Forest

Train Results:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Bad
##       Good  535  52
##       Bad    26 189
##                                               
##                Accuracy : 0.903               
##                  95% CI : (0.88, 0.922)       
##     No Information Rate : 0.7                 
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.761               
##                                               
##  Mcnemar's Test P-Value : 0.00464             
##                                               
##             Sensitivity : 0.954               
##             Specificity : 0.784               
##          Pos Pred Value : 0.911               
##          Neg Pred Value : 0.879               
##              Prevalence : 0.700               
##          Detection Rate : 0.667               
##    Detection Prevalence : 0.732               
##       Balanced Accuracy : 0.869               
##                                               
##        'Positive' Class : Good                
## 

Comparison

Credit_cost and Pricision are in sync.

train results are best for GBM. But its overfitting, i.e. variance is high, so not that great results on test.

test results are best for Random Forest. It has less variance then GBM, but bias is higher.

It may seem like that GBM is a better model, but we still haven’t seen the uncertainity (variance) in the results. Difference between train and test set results give some idea about it, but its better to see it on cross-validated results.

Not much difference here too, DRF seems only slightly better but that may change with fold assignment. For GBM, I did positive class upsample tuning but didn’t tune other hyperparameters. And for DRF I did the exact opposite. So, both the models have a lot of scope of tuning, and I am not at a stage to pick the right model

Important Features

We can see feature importance of either GBM or DRF, but DRF gives a better plot without breaking categorical features into its classes, so we will use DRF.

Topp-3 features are “checking_account_status”, “duration_in_months”, and “credit_amount”

Profiling of best credit-worthy person

To profile a ‘Good’ credit worthy person as per the model, let’s explore the relationship of top predictors with the predicted class for the DRF model.

So, the best credit worthy person would have a following profile:
- checking_account_status is “A14” i.e. no checking account
- duration_in_months is less than 12 month i.e. a year
- credit_amount is less than 2k
- credit_history is “A34” i.e. critical account/other existing credits
- Purpose is A43 i.e. radio/television

This seems slightly unintuitive, but I will have to go into model explainibility to get better insights, and currently the time is short for that

Things to do in future

  • EDA driven Feature Engineering
  • EDA driven Feature Selection
  • Better tuning
    • with cross validation on credit_cost or Precision
    • Bayesian Optimiation
  • Model Explainibility